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This research aimed to measure the effectiveness of Thai news headlines 
classification using an artificial neural network (ANN). The headlines 
consisted of i) political news, ii) sports news, iii) economic news, and 
iv) crime news, 1,200 headlines in total. The distribution of headlines was 
measured by using chi-square, information gain, and term frequency inverse 
class frequency (TFICF). Threshold default value was set in relation to terms 
of headlines before cross-validation was employed to categorize the data to 
examine the efficiency of the model using a neural network algorithm in 
classifying the headlines. The investigation of the news headline classification 
efficiency revealed that the 15-fold data division using TFICF was the most 
accurate in classifying headlines, with the accuracy rate of 99.60% and 


Classification F-measure rate of 99.05%. Moreover, it was found that when more news 

Information gain headlines were provided as the learning data, the news headline classification 

TFICF became more accurate. Likewise, appropriate threshold value determination 
facilitated the selection of appropriate features in the headlines and resulted in 
more effective and accurate classification. Hence, it can be concluded that 
headline classification will be more accurate if the appropriate amount of 
learning data exists, and appropriate threshold value was set. 
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1. INTRODUCTION 

Advances in information technology nowadays result in greater use of various information in 
electronic formats including information in the news or various documents. Electronic information tends to 
continuously increase in volume until it is challenging to be searched or classified. Also, a large volume of 
documents affects the search in terms of the accuracy and speed. Previous studies have focused exclusively on 
investigating the classification of documents in English. Research on the classification of documents in Thai 
is scarcely found. More importantly, Thai language has unique characteristics when compared to others 
languages in that there is no space between words in the written form, so it can be ambiguous [1]. This has a 
negative impact on the effectiveness and accuracy of document classification. 

In responding to the challenge mentioned above, the solution is to apply machine learning for text 
clustering or text classification. In this context, text clustering is an unsupervised learning method [2] in which 
documents are classified according to the content [3] where documents with similar characteristics are grouped 
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together. In contrast, text classification refers to a supervised learning method which depends on practicing and 
learning [4], [5] to classify documents according to their content using word attributes as their features. In the text 
classification process, it is necessary to perform text mining which is the process of extracting and analyzing texts 
in a large database [6] to discover the pattern or feature of a text in an unstructured or structured document in 
natural language, [7] integrating natural language processing with machine learning to serve different purposes. 
As a result, this study aims to investigate the efficiency of Thai news headlines classification through 
chi-square, information gain, term frequency inverse class frequency (TFICF), and threshold default value in 
processing the mean value of each word in each category using artificial neural network (ANN) in classifying 
news headlines. The findings will provide an efficient selection of news headline keywords which will result 
in high quality classification. In addition, it will provide a guideline for effective news headline classification. 


2. METHOD 
2.1. Data mining 

Data mining refers to the process of data extracting from a large database [8] to look for a pattern [9] 
or for useful and interesting information which can be used in making prediction. The steps involved in text 
mining include i) data cleaning, the process of removing useless data on the database, ii) data integration, the 
process of data compilation, iii) data transformation, the process of transforming data suitable for data analysis, 
iv) data selection, the process of selecting useful data for data analysis, v) data mining, the process of using 
data to create a model, vi) evaluation of patterns, the process of model evaluation, and vii) knowledge 
presentation, the process of presenting the results obtained from the model [10], [11]. 


2.2. Classification 

Classification refers to data classifying through a machine learning method in which a supervised 
learning technique where old data is used in creating a model to predict what will happen in the future [12]. To 
be specific, a certain amount of the data is provided for training data in creating a model [13], while the other 
part is used to test the efficiency of the model, known as testing data. 


2.3. Steps of model development 
The classification of Thai news headlines using neural network algorithms in this research involves 
eight steps of developing shown in Figure 1. Thai news headline classification model as follows: i) data 


collection, ii) data preprocessing, iii) feature selection, iv) feature weighting, v) vector space model (VSM), 
vi) cross-validation, vii) classification, and viii) measuring model performances. 


Data Collection 


Data Preprocessing 


Feature Weighting 


Vector Space Model 


Cross-validation 


Classification 
(Artificial Neural Network) 


Measuring Model 
Performances 


Figure 1. Steps of model development 
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2.4. Data collection 

Thai news headlines from the following Thai news agencies were collected: i) www.thairath.co.th, 
ii) www.khaosod.co.th, iii) www.dailynews.co.th, and iv) www.matichon.co.th. There were 1,200 news 
headlines in total. They were divided into four classes: i) political news, ii) sports news, iii) economic news, 
and iv) crime news. 


2.5. Data preprocessing 

This step involves data preparation processes before performing a headline classification performance 
test. There were 3 sub-steps involved as follows: i) word segmentation, the process of dividing each Thai words 
before processing the data with natural language in the next step. Because of news written Thai, words are put 
together without any punctuation marks which is contrastive to English where each word is discretely 
separated, ii) stop word removal, the process of removing common words which are considered insignificant 
[14]-[16] from documents. They frequently occur in documents and are used to connect sentences or complete 
texts in the documents. Therefore, it is necessary to remove these words from the documents, as they are not 
useful in document classifying [17]. These insignificant words include: pronouns, adverbs, interjections, 
prepositions, conjunctions and symbols such as (!, #, +, -,*,/,=,....) [18], and iii) stemming word, the process 
of substituting words that have the same meaning or words with the same root with only one word. This reduces 
the number of redundant words in the document and help increase the efficiency of document classification 
[19]-[21]. Figure 2 shows the data preparation process which consists of 3 steps: i) word segmentation, the 
process of word cutting by comparing with words archived in a dictionary, ii) stop word removal, the process 
of removing words that are not important from the document by comparing it to the stop words stored on the 
database, and iii) word stemming, the process of replacing words with the same meaning or words with the 
same root with a certain word by comparing with the database. 


Stop Word Removal Stop Word Database 


Stemming Word Root Word Database 


Figure 2. Step of data preprocessing 


2.6. Feature selection 

Feature selection involves selecting appropriate word attributes or words which are significant in 
feature subsets from all document attributes by eliminating duplicate, overlapping and irrelevant attributes from 
the document [22]. This reduces the number of features and assists in selecting the word attributes that are 
important in the classification of the document [23]. In this research, word attribute selection was performed 
using the chi-square, information gain, and the TFICF method (to calculate weight of terms). 


2.7. Chi-square 

Chi-square is a statistical calculation method used in examining the correlation between word features 
and document categories. Feature selection is processed by using frequencies of term (t) and possibility of term 
(t) occurring in each classification. Term characteristics were selected based on the frequency of term (t) and 
the probability of occurrence of term (t) in document group (C) [24] as shown in (1) [25]. 


: N(AD-CB)? 
Chi — square, .c;) = EEN TOAN (1) 
Where, 

N is the total number of documents 

A is the number of documents in the c; group in which the term tx exits 

B is the number of documents in which the term tg appears in other groups of documents 

C is the number of documents of group c; in which the term tẹ does not appear 


D is the number of other groups of documents in which the term tg does not appear 
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2.8. Information gain 

Information gain, a method for selecting features of keywords from a group of documents, is one of 
the most widely used methods. The selection is performed based on entropy theory [26] by using a prediction 
of documents in the i category in which the term t appears and does not appears as shown in (2) [27]. 


IG(t) = — X P(ci)logP(ci) + P(t) X P (cilt logP (cilt) + P(® X P (cild logP (cilt) (2) 


Where, 

C is a group of documents 

P(ci) is the probability of a document in group i 

P(t) and P(t) are the probability that the word t appears and does not appear in the document 

P(ci|t) is the probability condition for documents in group i in which the term t appear in the document 
P(ci|t) is the probability condition for documents in group i in which the t term does not appear in the 
document 


2.9. TFICF 

TFICF is a term weighting method to determine the correlation between a keyword and the documents 
of a collection, where f, q is the frequency of the t term appearing in document d [28], [29] and cf is adapted 
from idf (term frequency inverse document frequency) [30] in which |C| is the number of document categories 
and cf, is the number of document categories in which the word t appears as shown in (3) [29]. 


tficfe = fia X loge 3) 


2.10. Feature weighting 

Feature weighting is a process in which weight is assigned to each feature that promotes accuracy of 
document classification in the documents of a collection [31]. The weight of each term in the document is 
assigned according to its attributes [32]. As a result, in this study, the average frequency of terms in each 
document collection was set as a threshold value and the frequency of the term selected as a document attribute 
must be greater than or equal to the value in each document collection due to the fact that the term has a high 
frequency and is important to the document. Besides, it can represent a document and also affects the 
classification efficiency of the document more than terms with low frequency. 


2.11. Vector space model 

VSM is a mathematical model used for document classification. Each of its dimension is represented 
by the weight of word attributes in a document [33] in the matrix. In (4) [34] shows a VSM, in which djis the 
document and wij is the weight of tj. 


dj = (ty, Wij t2, W2, j; seg byt Wn,j) (4) 


2.12. Cross-validation 

In order to evaluate the effectiveness of the model, cross-validation was employed in this study to 
classify the data. The key concept was to divide the data into K sets, each set was the same size. Then some 
sets of the data were used in examining the effectiveness of the model so called the testing set, while another 
set was used in the model training process known as the training set. These two steps were repeated over and 
over until all sets of the data were processed as a training set. 


2.13. Artificial neural network 

ANN is a mathematical model in which a nonlinear learning element is incorporated to mimic the 
function of the human brain [35]. The key concept of the ANN is being progressive. In general, an ANN usually 
consists of three layers as show in Figure 3 which are i) the input layer, ii) the hidden layer, and iii) the output 
layer [36]. Every node in the same layer is connected to all nodes in the next layer [37]. As related weighted 
links are used in connecting between nodes, the outputs of the ANN depend on the modulation of the weight 
of the link. After that the data will be sent to the input layer before the output is displayed in the output layer 
[38]. Normally, the number of nodes in the input layer depends on the number of features of the dataset to be 
analyzed. Since the hidden layer affects the ability to learn about the model, the number of nodes on this layer 
depends on the needs of users and more nodes can be added for more accuracy. Meanwhile, the output layer is 
a part where the outputs of the ANN are presented [38]. The number of nodes in this layer depends on the 
format of the data to be classified. 
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Attribute i 


Node 0 


Attribute m 


Attribute n 


Input layer Hidden layer Output layer 


Figure 3. Network of neural network 


2.14. Measuring model performances 
The measurement of the model effectiveness was processed in the following areas: i) precision, 
ii) recall, iii) F-measure, and iv) accuracy. Each of them can be measured through the (5)-(8) [39]. 


TP 


Precision = —— (5) 
TP+FP 
EP. 
Recall = (6) 
TP+FN 
2xPrecision X Recall 
F — measure = ———————_ (7) 
Precision+Recall 
TP+TN 
Accuracy = ————— (8) 
TP+TN+FP+FN 


In which TP is the number of documents correctly guessed as Class C, TN is the number of documents 
correctly guessed not Class C, FP is the number of documents incorrectly guessed as Class C, and FN is the 
number of documents incorrectly guessed as not Class C. 


3. RESULTS AND DISCUSSION 

According to the headline classification performance test through term weighting by the methods of 
information gain, chi-square, TFICF and frequency labeling of terms in each group of documents as a threshold 
value, the details are as follows: threshold=4 for political news, threshold=4 for sport news, threshold=3 for 
economic news, and threshold=5 for crime news. Then the data was divided into 3 types through cross- 
validation including 5-fold, 10-fold, 15-fold. During this stage, the algorithms in the neural network calculated 
the accuracy, precision, recall, and F-measure to measure the effectiveness of the model. The results are shown 
in Table 1. As shown in Table 1, the performance test of Thai news headlines classification using the ANN 
revealed that the TFICF method was with the highest accuracy and F-measure score, 99.60 and 99.05, 
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respectively. This can be inferred that the TFICF weighting method is more accurate in classifying headlines 
than other methods. 

It was also found that when classifying sports news headlines and the minimum threshold value was 
set, the number of words that serve as headline features became greater than that of other news categories. 
Consequently, the higher number of features promotes the effectiveness of the headline classification. As 
shown in Figure 4, when calculating the weights of terms using information gain, chi-square and TFICF 
methods in which 15-fold was applied in dividing the data, higher precision and F-measure scores were found 
regardless of the methods. Therefore, it can be concluded that the amount of the learning data affects the 
efficiency and accuracy of news headlines classification. 


Table 1. The results of the effectiveness test of Thai news headlines classification using the ANN 


Cross-validation Feature select Accuracy Precision Recall F-measure 
5-fold Information gain 87.98 97.35 71.75 82.61 
Chi-square 89.80 81.25 74.09 77.50 
TFICF 87.90 70.62 71.06 70.84 
10-fold Information gain 95.21 96.91 88.00 92.24 
Chi-square 97.91 92.04 98.18 95.01 
TFICF 99.21 96.85 99.35 98.08 
15-fold Information gain 99.09 98.23 98.67 98.45 
Chi-square 99.01 98.86 96.66 97.75 
TFICF 99.60 99.37 98.75 99.05 
100 95.2 97.91 99.21 99.09 99.01 ine 
90 98.45 : 
80 
70 70.84 
60 
50 
40 
30 
20 
10 
0 
IG Chi-Square TFICF IG Chi-Square TFICF IG Chi-Square TFICF 
5-fold 10-fold 15-fold 


=æ Accuracy === F-measure 


Figure 4. The efficiency of news headlines classification 


4. CONCLUSION 

This research aimed to test the efficiency of news headlines classification using the information gain, 
chi-square, TFICF methods in determining weights of terms. The cross-validation method was applied in 
dividing the data to evaluate the effectiveness of the model, while neural network algorithms were applied in 
classifying the headlines. It was found that the efficiency and accuracy of the classification depends greatly on 
the amount of the learning data and the numbers of terms set as the headline features. In addition, when the 
learning data is adequate and the terms functioning as features are able to represent the document, then news 
headline classification will be more effective. 
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